Customer Personality Analysis helps a business to better
understand its customers and makes it easier to change or create a new
product. Helping with decision making.
For example, the decision whether to invest money to market a new
product, how much money to invest and what is the target customers.
The core of a customer personality analysis is getting the
answers to questions such as:
This is project aims to perform unsupervised Machine Learning techniques to summarize customer segments. This project will work with concepts of data mining such as data understanding, data preparation, modeling, evaluation,and deployment.
The dataset used on this analyze was obtained from Kaggle’s Customer Personality Analysis exercise
market_campaign <- as.data.frame(read.csv("marketing_campaign.csv"))
glimpse(market_campaign)
## Rows: 2,240
## Columns: 1
## $ ID.Year_Birth.Education.Marital_Status.Income.Kidhome.Teenhome.Dt_Customer.Recency.MntWines.MntFruits.MntMeatProducts.MntFishProducts.MntSweetProducts.MntGoldProds.NumDealsPurchases.NumWebPurchases.NumCatalogPurchases.NumStorePurchases.NumWebVisitsMonth.AcceptedCmp3.AcceptedCmp4.AcceptedCmp5.AcceptedCmp1.AcceptedCmp2.Complain.Z_CostContact.Z_Revenue.Response <chr> …
From the output, it`s time to panic a bit, the data is all merged
in a single column. But worry not, with the power of tidyverse we will
undo that tangled mess. Prepare to witness the magic of untangling data
chaos and creating order, one spell at a time!
So, lets start of counter-spell🪄
As we can observe the data is stored as
head(market_campaign, n = 5)
## ID.Year_Birth.Education.Marital_Status.Income.Kidhome.Teenhome.Dt_Customer.Recency.MntWines.MntFruits.MntMeatProducts.MntFishProducts.MntSweetProducts.MntGoldProds.NumDealsPurchases.NumWebPurchases.NumCatalogPurchases.NumStorePurchases.NumWebVisitsMo ...
## 1 5524\t1957\tGraduation\tSingle\t58138\t0\t0\t04-09-2012\t58\t635\t88\t546\t172\t88\t88\t3\t8\t10\t4\t7\t0\t0\t0\t0\t0\t0\t3\t11\t1
## 2 2174\t1954\tGraduation\tSingle\t46344\t1\t1\t08-03-2014\t38\t11\t1\t6\t2\t1\t6\t2\t1\t1\t2\t5\t0\t0\t0\t0\t0\t0\t3\t11\t0
## 3 4141\t1965\tGraduation\tTogether\t71613\t0\t0\t21-08-2013\t26\t426\t49\t127\t111\t21\t42\t1\t8\t2\t10\t4\t0\t0\t0\t0\t0\t0\t3\t11\t0
## 4 6182\t1984\tGraduation\tTogether\t26646\t1\t0\t10-02-2014\t26\t11\t4\t20\t10\t3\t5\t2\t2\t0\t4\t6\t0\t0\t0\t0\t0\t0\t3\t11\t0
## 5 5324\t1981\tPhD\tMarried\t58293\t1\t0\t19-01-2014\t94\t173\t43\t118\t46\t27\t15\t5\t5\t3\t6\t5\t0\t0\t0\t0\t0\t0\t3\t11\t0
So lets create step by step for this process in our Grimoire: *
Split the data in different columns * the separator is the regex
expression Convert each variable type * Rename dataset for
new_market_campaign
# Split the data in different columns
market_campaign <- market_campaign %>% separate (ID.Year_Birth.Education.Marital_Status.Income.Kidhome.Teenhome.Dt_Customer.Recency.MntWines.MntFruits.MntMeatProducts.MntFishProducts.MntSweetProducts.MntGoldProds.NumDealsPurchases.NumWebPurchases.NumCatalogPurchases.NumStorePurchases.NumWebVisitsMonth.AcceptedCmp3.AcceptedCmp4.AcceptedCmp5.AcceptedCmp1.AcceptedCmp2.Complain.Z_CostContact.Z_Revenue.Response, into = c("ID","Year_Birth","Education","Marital_Status","Income","Kidhome","Teenhome","Dt_Customer","Recency", "MntWines","MntFruits", "MntMeatProducts" , "MntFishProducts" , "MntSweetProducts","MntGoldProds", "NumDealsPurchases" , "NumWebPurchases" , "NumCatalogPurchases" , "NumStorePurchases", "NumWebVisitsMonth" , "AcceptedCmp3" , "AcceptedCmp4", "AcceptedCmp5" , "AcceptedCmp1","AcceptedCmp2", "Complain", "Z_CostContact", "Z_Revenue", "Response" ), sep = "\\t")
glimpse(market_campaign)
## Rows: 2,240
## Columns: 29
## $ ID <chr> "5524", "2174", "4141", "6182", "5324", "7446", "9…
## $ Year_Birth <chr> "1957", "1954", "1965", "1984", "1981", "1967", "1…
## $ Education <chr> "Graduation", "Graduation", "Graduation", "Graduat…
## $ Marital_Status <chr> "Single", "Single", "Together", "Together", "Marri…
## $ Income <chr> "58138", "46344", "71613", "26646", "58293", "6251…
## $ Kidhome <chr> "0", "1", "0", "1", "1", "0", "0", "1", "1", "1", …
## $ Teenhome <chr> "0", "1", "0", "0", "0", "1", "1", "0", "0", "1", …
## $ Dt_Customer <chr> "04-09-2012", "08-03-2014", "21-08-2013", "10-02-2…
## $ Recency <chr> "58", "38", "26", "26", "94", "16", "34", "32", "1…
## $ MntWines <chr> "635", "11", "426", "11", "173", "520", "235", "76…
## $ MntFruits <chr> "88", "1", "49", "4", "43", "42", "65", "10", "0",…
## $ MntMeatProducts <chr> "546", "6", "127", "20", "118", "98", "164", "56",…
## $ MntFishProducts <chr> "172", "2", "111", "10", "46", "0", "50", "3", "3"…
## $ MntSweetProducts <chr> "88", "1", "21", "3", "27", "42", "49", "1", "3", …
## $ MntGoldProds <chr> "88", "6", "42", "5", "15", "14", "27", "23", "2",…
## $ NumDealsPurchases <chr> "3", "2", "1", "2", "5", "2", "4", "2", "1", "1", …
## $ NumWebPurchases <chr> "8", "1", "8", "2", "5", "6", "7", "4", "3", "1", …
## $ NumCatalogPurchases <chr> "10", "1", "2", "0", "3", "4", "3", "0", "0", "0",…
## $ NumStorePurchases <chr> "4", "2", "10", "4", "6", "10", "7", "4", "2", "0"…
## $ NumWebVisitsMonth <chr> "7", "5", "4", "6", "5", "6", "6", "8", "9", "20",…
## $ AcceptedCmp3 <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", …
## $ AcceptedCmp4 <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ AcceptedCmp5 <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ AcceptedCmp1 <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ AcceptedCmp2 <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ Complain <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ Z_CostContact <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", …
## $ Z_Revenue <chr> "11", "11", "11", "11", "11", "11", "11", "11", "1…
## $ Response <chr> "1", "0", "0", "0", "0", "0", "0", "0", "1", "0", …
Note that all data is stored as
# Convert var type
market_campaign <- market_campaign %>% mutate_at(vars(Year_Birth, Income, Kidhome, Teenhome, Recency,
MntWines,MntFruits,MntMeatProducts,
MntFishProducts, MntSweetProducts,MntGoldProds, NumDealsPurchases,
NumWebPurchases, NumWebVisitsMonth, NumCatalogPurchases,
NumStorePurchases, Z_CostContact, Z_Revenue, AcceptedCmp3, AcceptedCmp4,
AcceptedCmp5, AcceptedCmp1, AcceptedCmp2),
as.numeric)
market_campaign$Dt_Customer <- as.Date(market_campaign$Dt_Customer, format = "%d-%m-%Y")
summary(market_campaign)
## ID Year_Birth Education Marital_Status
## Length:2240 Min. :1893 Length:2240 Length:2240
## Class :character 1st Qu.:1959 Class :character Class :character
## Mode :character Median :1970 Mode :character Mode :character
## Mean :1969
## 3rd Qu.:1977
## Max. :1996
##
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median : 51382 Median :0.0000 Median :0.0000 Median :2013-07-08
## Mean : 52247 Mean :0.4442 Mean :0.5062 Mean :2013-07-10
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-30
## Max. :666666 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
## NA's :24
## Recency MntWines MntFruits MntMeatProducts
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
## Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
## Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
## 3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
## Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
##
## MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
## Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
##
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
## Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
##
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## Min. :0.00000 Length:2240 Min. :3 Min. :11
## 1st Qu.:0.00000 Class :character 1st Qu.:3 1st Qu.:11
## Median :0.00000 Mode :character Median :3 Median :11
## Mean :0.01339 Mean :3 Mean :11
## 3rd Qu.:0.00000 3rd Qu.:3 3rd Qu.:11
## Max. :1.00000 Max. :3 Max. :11
##
## Response
## Length:2240
## Class :character
## Mode :character
##
##
##
##
At this point were identified 24 NA and a outlier
666666 on Income. The NA will simply be
removed, and total number of data-points after removing the rows with
missing values is: 2216
market_campaign <- market_campaign %>%
filter(!is.na(Income))
From Dt_Customer we can find: - The newest customer’s enrollment
date at the records: 2014-12-06 - The oldest customer’s enrollment date
at the records: 2012-01-08
Is interesting to add the Counting days that the customer is on
our purchase list, we will do that by creating a new feature
Days_of_register.
d1 = max(market_campaign$Dt_Customer) #taking it to be the newest customer
Days_of_register <- difftime(d1, market_campaign$Dt_Customer, units = "days")
market_campaign$Days_of_register <- as.numeric(Days_of_register)
About consumers it would be interesting to have data regarding *
Age extract from “Year_Birth” subtracted from “Dt_Customer”
* Total_Spent amount spent by the customer in various
categories over the span of two years. * Living_With out of
“Marital_Status” to extract the living situation of couples. *
Children_Count sum of number of kids and teenagers. *
Family_Size * Is_Parent to indicate parenthood
status 0 or 1
# Age calculation
data_colect_year <- format(market_campaign$Dt_Customer, "%Y")# converting for the same format
market_campaign$age <- as.integer(data_colect_year) - market_campaign$Year_Birth
# Total_Spent
market_campaign$Total_Spent <- as.integer (market_campaign$MntWines + market_campaign$MntFruits+
market_campaign$MntMeatProducts +
market_campaign$MntFishProducts +
market_campaign$MntSweetProducts +
market_campaign$MntGoldProds)
# Living_With
market_campaign$Living_With <- market_campaign$Marital_Status
market_campaign <- market_campaign %>%
mutate(Living_With = case_when(
Marital_Status %in% c("Married", "Together") ~ "Partner",
Marital_Status %in% c("Absurd", "Widow", "YOLO", "Divorced", "Single") ~ "Alone",
TRUE ~ ""
))
# Children_Count
market_campaign$Children_Count <- as.numeric(market_campaign$Kidhome + market_campaign$Teenhome)
# Family_Size
market_campaign$Family_Size <- as.numeric(market_campaign$Children_Count + 1) # countwithout partner
market_campaign <- market_campaign %>% mutate(Family_Size = case_when(Living_With == "Alone" ~ Family_Size + 0,
Living_With == "Partner" ~ Family_Size +1,
TRUE ~ Family_Size # Retain Family_Size for other cases
))
# Is_Parent
market_campaign$Is_Parent <- if_else(market_campaign$Children_Count > 0, 1, 0)
From purchase habits is interesting to have the number of
promotion accepted by certain client.
market_campaign$Total_AcceptedCmp <- as.integer(market_campaign$AcceptedCmp3+
market_campaign$AcceptedCmp4+
market_campaign$AcceptedCmp5+
market_campaign$AcceptedCmp1+
market_campaign$AcceptedCmp2)
For better visualization I will rename the colunms that have
product type on it.
# Renaming columns
new_prod_names <- str_sub(colnames(market_campaign[,c(10:15)]), 4,20)
prod_names <- c("MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts",
"MntGoldProds")
new_prod_names <- gsub("Products|Prods|Mnt", "" , prod_names)
for (i in 10:15) {
j <- i - 9 # Adjust j to start from 1
if (j <= length(new_prod_names)) {
colnames(market_campaign)[i] <- new_prod_names[j]
} else {
break # Exit the loop if j exceeds new_prod_names length
}
}
All good! Data Prepared and ready to the Exploratory Analises Time to dive in into the Middle-Earth and talk to the Statistics Wizzard!
freq_tab_ed <- table(market_campaign$Education)
print(freq_tab_ed)
##
## 2n Cycle Basic Graduation Master PhD
## 200 54 1116 365 481
freq_tab_marstat <- table(market_campaign$Marital_Status)
print(freq_tab_marstat)
##
## Absurd Alone Divorced Married Single Together Widow YOLO
## 2 3 232 857 471 573 76 2
summary(market_campaign)
## ID Year_Birth Education Marital_Status
## Length:2216 Min. :1893 Length:2216 Length:2216
## Class :character 1st Qu.:1959 Class :character Class :character
## Mode :character Median :1970 Mode :character Mode :character
## Mean :1969
## 3rd Qu.:1977
## Max. :1996
## Income Kidhome Teenhome Dt_Customer
## Min. : 1730 Min. :0.0000 Min. :0.0000 Min. :2012-07-30
## 1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2013-01-16
## Median : 51382 Median :0.0000 Median :0.0000 Median :2013-07-08
## Mean : 52247 Mean :0.4418 Mean :0.5054 Mean :2013-07-10
## 3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2013-12-31
## Max. :666666 Max. :2.0000 Max. :2.0000 Max. :2014-06-29
## Recency Wines Fruits Meat
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:24.00 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0
## Median :49.00 Median : 174.5 Median : 8.00 Median : 68.0
## Mean :49.01 Mean : 305.1 Mean : 26.36 Mean : 167.0
## 3rd Qu.:74.00 3rd Qu.: 505.0 3rd Qu.: 33.00 3rd Qu.: 232.2
## Max. :99.00 Max. :1493.0 Max. :199.00 Max. :1725.0
## Fish Sweet Gold NumDealsPurchases
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
## Median : 12.00 Median : 8.00 Median : 24.50 Median : 2.000
## Mean : 37.64 Mean : 27.03 Mean : 43.97 Mean : 2.324
## 3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
## Max. :259.00 Max. :262.00 Max. :321.00 Max. :15.000
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.085 Mean : 2.671 Mean : 5.801 Mean : 5.319
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.000 Max. :20.000
## AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.07356 Mean :0.07401 Mean :0.0731 Mean :0.06408
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## AcceptedCmp2 Complain Z_CostContact Z_Revenue
## Min. :0.00000 Length:2216 Min. :3 Min. :11
## 1st Qu.:0.00000 Class :character 1st Qu.:3 1st Qu.:11
## Median :0.00000 Mode :character Median :3 Median :11
## Mean :0.01354 Mean :3 Mean :11
## 3rd Qu.:0.00000 3rd Qu.:3 3rd Qu.:11
## Max. :1.00000 Max. :3 Max. :11
## Response Days_of_register age Total_Spent
## Length:2216 Min. : 0.0 Min. : 16.00 Min. : 5.0
## Class :character 1st Qu.:180.0 1st Qu.: 36.00 1st Qu.: 69.0
## Mode :character Median :355.5 Median : 43.00 Median : 396.5
## Mean :353.5 Mean : 44.21 Mean : 607.1
## 3rd Qu.:529.0 3rd Qu.: 54.00 3rd Qu.:1048.0
## Max. :699.0 Max. :121.00 Max. :2525.0
## Living_With Children_Count Family_Size Is_Parent
## Length:2216 Min. :0.0000 Min. :1.000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000
## Mode :character Median :1.0000 Median :3.000 Median :1.0000
## Mean :0.9472 Mean :2.593 Mean :0.7144
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :5.000 Max. :1.0000
## Total_AcceptedCmp
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2983
## 3rd Qu.:0.0000
## Max. :4.0000
Note that at this moment were identified some outliers for
Income the value 666666 is very high, as well is the
Age of 121 we have two. In both cases is not possible
determine if the numbers were a real input or an error during the data
collection. Also those outliers have a big impact on the Mean. Before to
decide how to deal with the outliers, lets plot a correlation graphic
and boxplot.
#Boxplot for the first 10 variables
Outliers <- c(colnames(market_campaign[,c(5:7,9:15)]))
ggplotly(
market_campaign[,c(5:7,9:15)] %>%
melt() %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(x = "Variable",
y = "Value") +
scale_fill_manual("Legend:",
values = rainbow(n = 36)) +
theme_dark()
)
## No id variables; using all as measure variables
#Boxplot for 16-26 variables
Outliers <- c(colnames(market_campaign[,c(16:19)]))
ggplotly(
market_campaign[,c(16:20)] %>%
melt() %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(x = "Variable",
y = "Value") +
scale_fill_manual("Legend:",
values = rainbow(n = 36)) +
theme_dark()
)
## No id variables; using all as measure variables
ggplotly()
market_campaign[,c(20,31,34,35)] %>%
melt() %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(x = "Variable",
y = "Value") +
scale_fill_manual("Legend:",
values = rainbow(n = 36)) +
theme_dark()
## No id variables; using all as measure variables
ggplotly()
market_campaign[,c(30,32)] %>%
melt() %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
geom_point(alpha = 0.5) +
labs(x = "Variable",
y = "Value") +
scale_fill_manual("Legend:",
values = rainbow(n = 36)) +
theme_dark()
## No id variables; using all as measure variables
Clearly, there are a few outliers in the Income and Age
features. In this project it will be managed by deleting the
outliers.
market_campaign <- market_campaign %>% filter((Income < 600000 | age < 114))
print(paste0("The total number of data-points after removing the outliers are: ", nrow(market_campaign)))
## [1] "The total number of data-points after removing the outliers are: 2216"
Evans (1996) suggests for the absolute value of r:
# Correlation matrix
matrix_1 <- market_campaign[, c(5, 9:20, 30:32, 34,35)]
cor_matrix <- cor(matrix_1)
# Plot the correlation matrix
corrplot(cor_matrix, method = "color")
matrix_2 <- market_campaign[, c(5, 9:16)]
pairs.panels(matrix_2)
matrix_3 <- market_campaign[, c( 17:20 ,30:32, 34,35)]
pairs.panels(matrix_3)
By the correlation matrix is possible to note that the
correlation is:
Categorical encoding is the process of converting categorical to
numerical data so that a machine learning algorithm understands it. It
simply converts categories to numbers.The two most widely used
techniques are:
We apply One-Hot Encoding when:
The categorical feature is not ordinal. The number of categorical features is is not big. In One-hot encoding each category is mutually exclusive. For example, “Red” may be encoded as [1, 0, 0], “Green” as [0, 1, 0], and “Blue” as [0, 0, 1].
We apply Label Encoding when:
The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school) and the number of categories is quite large as one-hot encoding can lead to high memory consumption.
It preserves the ordinal relationship between categories if present. For example, “Red” may be encoded as 1, “Green” as 2, and “Blue” as 3.
We apply Ordinal Encoding:
Ordinal encoding is similar to label encoding but considers the order or rank of categories. For example, “Ocean” may be encoded as 1, “Sea” as 2, and “Coast” as 3.
Althought a Label encoding is space-efficient, it may introduce an arbitrary order to categorical values. One-hot encoding avoids this issue by creating binary columns for each category, but it can lead to high-dimensional data.
For the categorical vars in the present project is possible to use
ordinal encoding for Education and label encoding for
Marital_Status and Living_With. Or as there is
no need to preserve the categorical rank we can simply apply
Label Encoding for all vars except Living_With
that can be done by one-hot encoding.
Thus, the label encoding will be applied for Education,
Marital_Status and Living_With by one-hot
encoding.
# Changing Education for numeric label
market_campaign$Education <- as.numeric(factor(market_campaign$Education))
# Changing the Living With
market_campaign$Marital_Status <- as.numeric(factor(market_campaign$Marital_Status))
market_campaign <- dummy_cols(market_campaign, select_columns = "Living_With", remove_selected_columns = TRUE)
market_campaign <- market_campaign[, -c(37,38)]
market_campaign <- market_campaign %>% mutate_at(vars(Complain, Response, Living_With_Partner), as.factor)
Note that the categories are : - Graduation: 3 - PhD: 5 - Master:
4 - Basic: 2 - 2n Cycle: 1
The labels for Marital_Status: - Single: 5 - Together: 6 -
Married: 4 - Divorced: 3 - Widow: 7 - Alone: 2 - Absurd: 1 - YOLO :
8
Was created a new var called Living_With_Partner
where: * Yes: 1 * No: 0
market_campaign <- market_campaign[, -c(1,2,4,6:8,21:25,27,28)]
head(market_campaign)
## Education Income Recency Wines Fruits Meat Fish Sweet Gold NumDealsPurchases
## 1 3 58138 58 635 88 546 172 88 88 3
## 2 3 46344 38 11 1 6 2 1 6 2
## 3 3 71613 26 426 49 127 111 21 42 1
## 4 3 26646 26 11 4 20 10 3 5 2
## 5 5 58293 94 173 43 118 46 27 15 5
## 6 4 62513 16 520 42 98 0 42 14 2
## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 1 8 10 4 7
## 2 1 1 2 5
## 3 8 2 10 4
## 4 2 0 4 6
## 5 5 3 6 5
## 6 6 4 10 6
## Complain Response Days_of_register age Total_Spent Children_Count Family_Size
## 1 0 1 663 55 1617 0 1
## 2 0 0 113 60 27 2 3
## 3 0 0 312 48 776 0 2
## 4 0 0 139 30 53 1 3
## 5 0 0 161 33 422 1 3
## 6 0 0 293 46 716 1 3
## Is_Parent Total_AcceptedCmp Living_With_Partner
## 1 0 0 0
## 2 1 0 0
## 3 0 0 1
## 4 1 0 1
## 5 1 0 1
## 6 1 0 1
After clean all unnecessary data we still with 25 variables, many
of them correlated. As that song from Depeche Mode would say “It’s no
good!” but instead “be waiting patiently” to the data get smaller by
itself is possible to perform dimensionality reduction.
Dimensionality reduction is the process of reducing the number of
random variables under consideration, by obtaining a set of principal
variables. And is here that the magic of modeling starts, with the
PCA (Principal component analysis).
####Steps to run the PCA: 1. Standardize the d-dimensional dataset. 2. Check if is adequate conduce the PCA by the Bartlett’s Test of Sphericity 2. Construct the covariance matrix. 3. Decompose the covariance matrix into its eigenvectors and eigenvalues. 4. Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors. 5. Select k eigenvectors, which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (𝑘≤ 𝑑 ). 6. Construct a projection matrix, W, from the “top” k eigenvectors. 7. Transform the d-dimensional input dataset, using the projection matrix, to obtain the new k-dimensional feature subspace.
Before start is necessary to standardize the dataset.
# Standardize data
market_campaign <- market_campaign %>% mutate(across(everything(), as.numeric))
market_campaign_std <- market_campaign[, -c(1)]%>%
mutate(across(everything(), (scale))) %>%
as.matrix()
Before perform any a data reduction technique such as principal
component analysis or factor analysis is necessary to verify if the data
reduction can compress the data without loose meaningful
variables.
The Bartlett’s Test of Sphericity compares an observed correlation
matrix to the identity matrix, and checks if there is a redundancy
between the variable.
The null(H0) hypothesis of the test is that the
variables are orthogonal, i.e. not correlated. The alternative
hypothesis (H1) is that the variables are not orthogonal,
i.e. they are correlated enough to where the correlation matrix diverges
significantly from the identity matrix.
H0 : equal to identity matrix (p-value > alpha)
H1 : differs from the identity matrix (p-value < alpha)
Note: the Bartlett`s test must be performed on the original data
rather than in the std version.
# Bartlett's Test
correl_matrix <- cor(market_campaign)
correl_matrix
## Education Income Recency Wines
## Education 1.00000000 0.120692359 -0.0114182027 0.197886295
## Income 0.12069236 1.000000000 -0.0039697555 0.578649750
## Recency -0.01141820 -0.003969756 1.0000000000 0.015721019
## Wines 0.19788630 0.578649750 0.0157210194 1.000000000
## Fruits -0.08246221 0.430841681 -0.0058437499 0.387023861
## Meat 0.03996094 0.584633357 0.0225176351 0.568860003
## Fish -0.11474748 0.438871359 0.0005509232 0.397721050
## Sweet -0.10727879 0.440743792 0.0251097703 0.390325802
## Gold -0.09708394 0.325916446 0.0176626377 0.392730993
## NumDealsPurchases 0.02620787 -0.083100896 0.0021154508 0.008885929
## NumWebPurchases 0.08242518 0.387877811 -0.0056408538 0.553785939
## NumCatalogPurchases 0.06904876 0.589162442 0.0240814076 0.634752741
## NumStorePurchases 0.06779203 0.529362140 -0.0004338266 0.640011908
## NumWebVisitsMonth -0.04082162 -0.553088012 -0.0185636434 -0.321977901
## Complain -0.05086296 -0.027224512 0.0136366703 -0.039470211
## Response 0.09080602 0.133046664 -0.1997663693 0.246298957
## Days_of_register -0.04890128 -0.018530777 0.0259625870 0.168049485
## age 0.17306509 0.162556855 0.0146998145 0.150105920
## Total_Spent 0.09406922 0.667576090 0.0200656546 0.893135723
## Children_Count 0.05473709 -0.293351925 0.0182900870 -0.353747647
## Family_Size 0.03757485 -0.240147877 0.0144021277 -0.296388165
## Is_Parent 0.02344136 -0.338153413 0.0024851549 -0.343094115
## Total_AcceptedCmp 0.03760444 0.308381090 -0.0136471809 0.510832559
## Living_With_Partner -0.01456363 0.004663398 -0.0013710486 -0.007243788
## Fruits Meat Fish Sweet
## Education -0.082462215 0.03996094 -0.1147474825 -0.10727879
## Income 0.430841681 0.58463336 0.4388713595 0.44074379
## Recency -0.005843750 0.02251764 0.0005509232 0.02510977
## Wines 0.387023861 0.56886000 0.3977210502 0.39032580
## Fruits 1.000000000 0.54782217 0.5934310503 0.57160606
## Meat 0.547822166 1.00000000 0.5735740153 0.53513611
## Fish 0.593431050 0.57357402 1.0000000000 0.58386696
## Sweet 0.571606063 0.53513611 0.5838669550 1.00000000
## Gold 0.396486924 0.35944628 0.4271420401 0.35744975
## NumDealsPurchases -0.134512099 -0.12130771 -0.1432410856 -0.12143193
## NumWebPurchases 0.302038849 0.30709037 0.2996875104 0.33393722
## NumCatalogPurchases 0.486263071 0.73412660 0.5327567837 0.49513582
## NumStorePurchases 0.458491031 0.48600555 0.4577450432 0.45522516
## NumWebVisitsMonth -0.418728932 -0.53948442 -0.4464232918 -0.42237080
## Complain -0.005324099 -0.02378194 -0.0212202304 -0.02264120
## Response 0.122442679 0.23774642 0.1081451099 0.11617037
## Days_of_register 0.067957968 0.08957703 0.0818344445 0.08098580
## age 0.014556100 0.02910979 0.0364573564 0.01593226
## Total_Spent 0.613248760 0.84588420 0.6423707848 0.60706155
## Children_Count -0.395900937 -0.50454471 -0.4278407939 -0.38941066
## Family_Size -0.341153564 -0.42959246 -0.3633413615 -0.33048169
## Is_Parent -0.411962985 -0.57493054 -0.4503175284 -0.40272161
## Total_AcceptedCmp 0.157090225 0.30718976 0.1761107066 0.20148510
## Living_With_Partner -0.025987882 -0.02331328 -0.0179863528 -0.01594848
## Gold NumDealsPurchases NumWebPurchases
## Education -0.09708394 0.0262078719 0.082425179
## Income 0.32591645 -0.0831008957 0.387877811
## Recency 0.01766264 0.0021154508 -0.005640854
## Wines 0.39273099 0.0088859288 0.553785939
## Fruits 0.39648692 -0.1345120994 0.302038849
## Meat 0.35944628 -0.1213077141 0.307090366
## Fish 0.42714204 -0.1432410856 0.299687510
## Sweet 0.35744975 -0.1214319277 0.333937217
## Gold 1.00000000 0.0519048294 0.407065666
## NumDealsPurchases 0.05190483 1.0000000000 0.241440318
## NumWebPurchases 0.40706567 0.2414403183 1.000000000
## NumCatalogPurchases 0.44242825 -0.0121184280 0.386867640
## NumStorePurchases 0.38918017 0.0661065938 0.516240183
## NumWebVisitsMonth -0.24769056 0.3460483800 -0.051226263
## Complain -0.03113346 0.0004972467 -0.016641779
## Response 0.14033164 0.0034510733 0.151431233
## Days_of_register 0.16050486 0.2184982775 0.192762503
## age 0.05580780 0.0479144530 0.142887919
## Total_Spent 0.52870784 -0.0658538502 0.528973336
## Children_Count -0.26891799 0.4360758116 -0.148870808
## Family_Size -0.23584611 0.3744637483 -0.121295776
## Is_Parent -0.24743268 0.3884245096 -0.074007634
## Total_AcceptedCmp 0.19682192 -0.1257271367 0.196823395
## Living_With_Partner -0.02544210 0.0261473157 0.003454789
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Education 0.069048763 0.0677920332 -0.040821622
## Income 0.589162442 0.5293621403 -0.553088012
## Recency 0.024081408 -0.0004338266 -0.018563643
## Wines 0.634752741 0.6400119079 -0.321977901
## Fruits 0.486263071 0.4584910315 -0.418728932
## Meat 0.734126598 0.4860055453 -0.539484417
## Fish 0.532756784 0.4577450432 -0.446423292
## Sweet 0.495135818 0.4552251636 -0.422370804
## Gold 0.442428252 0.3891801722 -0.247690557
## NumDealsPurchases -0.012118428 0.0661065938 0.346048380
## NumWebPurchases 0.386867640 0.5162401827 -0.051226263
## NumCatalogPurchases 1.000000000 0.5178404511 -0.522003774
## NumStorePurchases 0.517840451 1.0000000000 -0.432398257
## NumWebVisitsMonth -0.522003774 -0.4323982573 1.000000000
## Complain -0.020839191 -0.0169407070 0.019785006
## Response 0.219913612 0.0362411292 -0.002208954
## Days_of_register 0.097610606 0.1126184473 0.276016404
## age 0.116472244 0.1219227241 -0.137955315
## Total_Spent 0.780481781 0.6751811033 -0.499081969
## Children_Count -0.443473975 -0.3232127180 0.416076476
## Family_Size -0.371974321 -0.2649185749 0.345705485
## Is_Parent -0.453470080 -0.2849271911 0.476234070
## Total_AcceptedCmp 0.346220700 0.2027434081 -0.165502004
## Living_With_Partner -0.009854837 0.0045217548 0.003021724
## Complain Response Days_of_register age
## Education -0.0508629608 0.090806018 -0.0489012830 0.173065089
## Income -0.0272245123 0.133046664 -0.0185307766 0.162556855
## Recency 0.0136366703 -0.199766369 0.0259625870 0.014699814
## Wines -0.0394702112 0.246298957 0.1680494850 0.150105920
## Fruits -0.0053240986 0.122442679 0.0679579682 0.014556100
## Meat -0.0237819441 0.237746418 0.0895770333 0.029109787
## Fish -0.0212202304 0.108145110 0.0818344445 0.036457356
## Sweet -0.0226412002 0.116170373 0.0809858011 0.015932264
## Gold -0.0311334593 0.140331644 0.1605048638 0.055807800
## NumDealsPurchases 0.0004972467 0.003451073 0.2184982775 0.047914453
## NumWebPurchases -0.0166417790 0.151431233 0.1927625029 0.142887919
## NumCatalogPurchases -0.0208391906 0.219913612 0.0976106064 0.116472244
## NumStorePurchases -0.0169407070 0.036241129 0.1126184473 0.121922724
## NumWebVisitsMonth 0.0197850059 -0.002208954 0.2760164044 -0.137955315
## Complain 1.0000000000 -0.002029294 0.0332790490 0.028916787
## Response -0.0020292937 1.000000000 0.1964793692 -0.033435733
## Days_of_register 0.0332790490 0.196479369 1.0000000000 -0.072164180
## age 0.0289167872 -0.033435733 -0.0721641799 1.000000000
## Total_Spent -0.0374276177 0.264127177 0.1585771047 0.104967891
## Children_Count 0.0317737462 -0.167648413 -0.0259413338 0.088891113
## Family_Size 0.0234414710 -0.217986778 -0.0271874589 0.072749508
## Is_Parent 0.0206080291 -0.203741851 0.0008881036 -0.012159163
## Total_AcceptedCmp -0.0223880678 0.427123781 -0.0120094971 0.003565142
## Living_With_Partner -0.0053686489 -0.150161163 -0.0108512123 -0.001450542
## Total_Spent Children_Count Family_Size Is_Parent
## Education 0.09406922 0.05473709 0.03757485 0.0234413637
## Income 0.66757609 -0.29335192 -0.24014788 -0.3381534132
## Recency 0.02006565 0.01829009 0.01440213 0.0024851549
## Wines 0.89313572 -0.35374765 -0.29638817 -0.3430941150
## Fruits 0.61324876 -0.39590094 -0.34115356 -0.4119629845
## Meat 0.84588420 -0.50454471 -0.42959246 -0.5749305420
## Fish 0.64237078 -0.42784079 -0.36334136 -0.4503175284
## Sweet 0.60706155 -0.38941066 -0.33048169 -0.4027216054
## Gold 0.52870784 -0.26891799 -0.23584611 -0.2474326815
## NumDealsPurchases -0.06585385 0.43607581 0.37446375 0.3884245096
## NumWebPurchases 0.52897334 -0.14887081 -0.12129578 -0.0740076339
## NumCatalogPurchases 0.78048178 -0.44347397 -0.37197432 -0.4534700798
## NumStorePurchases 0.67518110 -0.32321272 -0.26491857 -0.2849271911
## NumWebVisitsMonth -0.49908197 0.41607648 0.34570548 0.4762340699
## Complain -0.03742762 0.03177375 0.02344147 0.0206080291
## Response 0.26412718 -0.16764841 -0.21798678 -0.2037418514
## Days_of_register 0.15857710 -0.02594133 -0.02718746 0.0008881036
## age 0.10496789 0.08889111 0.07274951 -0.0121591625
## Total_Spent 1.00000000 -0.50024427 -0.42394055 -0.5226292071
## Children_Count -0.50024427 1.00000000 0.84932612 0.7998054843
## Family_Size -0.42394055 0.84932612 1.00000000 0.6917811150
## Is_Parent -0.52262921 0.79980548 0.69178111 1.0000000000
## Total_AcceptedCmp 0.45709549 -0.24589113 -0.20302825 -0.2797884829
## Living_With_Partner -0.01934746 0.04219466 0.56323540 0.0573789899
## Total_AcceptedCmp Living_With_Partner
## Education 0.0376044410 -0.0145636347
## Income 0.3083810900 0.0046633983
## Recency -0.0136471809 -0.0013710486
## Wines 0.5108325594 -0.0072437882
## Fruits 0.1570902246 -0.0259878817
## Meat 0.3071897637 -0.0233132823
## Fish 0.1761107066 -0.0179863528
## Sweet 0.2014851017 -0.0159484810
## Gold 0.1968219238 -0.0254420990
## NumDealsPurchases -0.1257271367 0.0261473157
## NumWebPurchases 0.1968233948 0.0034547889
## NumCatalogPurchases 0.3462206999 -0.0098548365
## NumStorePurchases 0.2027434081 0.0045217548
## NumWebVisitsMonth -0.1655020044 0.0030217241
## Complain -0.0223880678 -0.0053686489
## Response 0.4271237810 -0.1501611634
## Days_of_register -0.0120094971 -0.0108512123
## age 0.0035651417 -0.0014505421
## Total_Spent 0.4570954896 -0.0193474565
## Children_Count -0.2458911345 0.0421946621
## Family_Size -0.2030282527 0.5632354034
## Is_Parent -0.2797884829 0.0573789899
## Total_AcceptedCmp 1.0000000000 0.0006280779
## Living_With_Partner 0.0006280779 1.0000000000
rho <- cortest.bartlett((correl_matrix))
## Warning in cortest.bartlett((correl_matrix)): n not specified, 100 used
rho
## $chisq
## [1] 7558.289
##
## $p.value
## [1] 0
##
## $df
## [1] 276
For the current dataset X2Bartlett = 7298.936 for the
Degree of Freedom 351 and alpha = 5, p-value = 0, then this
dataset is suitable for a data reduction technique.
# Factors
fact_1 <- prcomp(market_campaign_std)
fviz_pca_var(fact_1, col.var="steelblue")#
The cumulative variance of two principals is equal to 0.443. More
Principal Components may be working its magic to explain enough
variance. In order to determine the adequate number of PCs with Kaiser
Criterion.
# Eigenvalues
eigenvalues <- round(as.numeric(fact_1$sdev^2))
print(k <- sum(eigenvalues))
## [1] 21
shared_variance <- as.data.frame (eigenvalues/k) %>%
slice(1:26) %>%
melt()%>%
mutate(PC = paste0("PC", row_number())) %>%
rename(Shared_Variance = 1)
## No id variables; using all as measure variables
shared_variance %>%
melt()%>%
ggplot(aes(x = PC, y = value, fill = variable))+
geom_col(col= "grey30", fill = "grey39")+
geom_text(aes(label = paste0(round(value * 100, 2), "%")), col = "black", vjust = -0.3, size = 2)+
labs(x = "PC", y = "Shared Variance") +
theme_gray(base_size =8)
## Using Shared_Variance, PC as id variables
At this step we have that the sum of the eigenvalues is 21 and
also we have too many components. Is not even possible to differentiate
them.Also is possible to visualize at the Chart that 11 PCs have to low
contribution to the variance. We know that certain group of variables
represented by a factor extracted from eigenvalues smaller than 1 are
possibly not representing the behaviour of a original variable
(exceptions are rare). Exceptions, usually occurs for values smaller but
near to 1. The criteria of choice for number of eigenvalues > 1 is
known as the Kaiser Criterion(a.k.a. Latent Root Criterion).
# Kaiser Criterion
k <- sum(eigenvalues > 1)
print(k)
## [1] 3
We have 3 eigenvalues remaining, and therefore accordingly to
the criterion 3 Principal Components to be selected. So the PCs that
will be kept are the 3 ones with that most contribute for the shared
Variance. Therefore, it will be kept PC1, PC2 and PC3.
Now the next step is evaluate which variable constitute the major
part of the PC.
# Running the PCA for 3 factors
# Contributions for PC1
var <- get_pca_var(fact_1) #variable extraction
a <- fviz_contrib(fact_1, "var", axes = 1, xtickslab.rt = 90)
print(plot(a, main = "Variables percentage contribution of first Principal Components"))
For the PC1 the variables that most contribute for the generation
of the components were Meat, Wines, Fish, Income, Total_Spent,
NumCatalogPurchase, Fruits, Sweets, NumWebVisitsMonth,
NumStorePurchase,Family_Size and Child_Count.
# Contributions for PC2
b <- fviz_contrib(fact_1, "var", axes = 2, xtickslab.rt = 90)
print(plot(b, main = "Variables percentage contribution of first Principal Components"))
For PC2 the main contributors are NumDealsPurchases, Family_Size,
NumWebPurchases, Children_Count, Is_Parent, Win and
Days_of_Register.
# Contributios for PC3
c <- fviz_contrib(fact_1, "var", axes = 3 , xtickslab.rt = 90)
print(plot(c, main = "Variables percentage contribution of first Principal Components"))
At PC3 we have as bigger contributors the variables Response,
Days_of_Register, NumWebVisits, Living_With_Partner, , Family_size and
Total_AcceptedCmp.
Time to run one scaterplot for the PC with k=3 and apply the
Hopkins test to see if the dataset has a tendency to clusters.But why
should we do it? Because a big issue is that clustering methods will
return clusters even if the data does not contain any clusters.
The Hopkins statistic (Lawson and Jurs 1990) is used to assess
the clustering tendency of a data set by measuring the probability that
a given data set is generated by uniform data distribution. In other
words, it tests the spatial randomness of the data.
We can conduct the Hopkins Statistic test iteratively, using 0.5
as the threshold to reject the alternative hypothesis. That is, if H
< 0.5, then it is unlikely that D has statistically significant
clusters. Put in other words, If the value of Hopkins statistic is close
to 1, then we can reject the null hypothesis and conclude that the
dataset D is significantly a clusterable data.
pca2 <-prcomp(market_campaign_std, center=FALSE, scale.=FALSE, rank. = 3) # stats::
results <- as.data.frame(pca2$x)
print(hop_stat <- clustertend:::hopkins(results, n = ceiling(nrow(results)/10)))
## Warning: Package `clustertend` is deprecated. Use package `hopkins` instead.
## $H
## [1] 0.1859002
specifically for clustertend package the output
value for the function hopkins() gives 1- Hopkins
statistics, so smaller the statistic, the better chances of Clusters. It
means that Hstat = 0.8257675 . Thus, as Hopkins statistic is close to 1,
then we can reject the null hypothesis and conclude that the dataset is
significantly a clusterable data.
# Create a color gradient from red to blue
color_palette <- brewer.pal(9, "GnBu")
# Create a vector of colors with the same length as the data
colors <- rep(color_palette, length.out = nrow(results))
# Create the scatterplot with the specified color palette
scatterplot3d(results$PC1, results$PC2, results$PC3,
pch = 16, main = "3D Scatter Plot",
xlab = "PC1", ylab = "PC2", zlab = "PC3",
color = colors)
At this point is not possible to differentiate any cluster yet.
Now that we have reduced the dimensions for 4 using the magic of
“DimensioNimbus” and “PCAtronus” we can jump to Clustering.
Call me Hermione DataChangers 🧙️
#Elbow Method using results from pac2 as input
fviz_nbclust(results, kmeans, method ="wss", k.max = 20) +
geom_vline(xintercept = 4, linetype = "dashed", color = "red" )
By the Elbow Chart the optimal number of Clusters(k) will be 4,
that there is where it reduce the slope.Let’s try to fit the K-means
Clustering Model to get the final clusters.
# Calculates the matrix
dm <- dist(results)
hc <- hclust(dm, method = "complete")
coeficientes <- sort(hc$height, decreasing = FALSE)
##K-means
The next step is to choose the most suitable distance metrics.
Clustering for two different distance measures will be conducted,
specifically for:
Only those two due the data characteristics (continuous and
negative).
# Generating the model
kmeans_model <- eclust(results, "kmeans", hc_metric = "eucliden", k = 4)
fviz_silhouette(kmeans_model)
## cluster size ave.sil.width
## 1 1 542 0.27
## 2 2 421 0.22
## 3 3 515 0.42
## 4 4 738 0.38
#### Clustering Raw data
km1 <- eclust(market_campaign_std, "kmeans", hc_metric="eucliden",k=4)
fviz_silhouette(km1)
## cluster size ave.sil.width
## 1 1 552 0.06
## 2 2 374 0.10
## 3 3 504 0.12
## 4 4 786 0.22
Analysing the above results, Clustering on raw data definitely
shows smaller average silhouette width. Thus, PCA analysis definitely
helped and improved the final results of K-means clustering.
data <- as.data.frame(kmeans_model$data)
cluster_labels <- kmeans_model$cluster
kmeans_model$cluster %>%
scatterplot3d( data$PC1, data$PC2, data$PC3,
color = cluster_labels, pch = 16, main = "3D Scatter Plot",
xlab = "PC1", ylab = "PC2", zlab = "PC3")
### Evaluating the models
cluster_rank <- as.data.frame(kmeans_model$cluster)
colnames(cluster_rank) <- "cluster"
cluster_rank %>%
ggplot(aes(x = cluster, fill = cluster)) +
geom_bar() +
labs(title = "Distribution of the Clusters") +
theme_minimal() +
theme(plot.title = element_text(size = 14, face = "bold"),
legend.position = "none")
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
The clusters seem to be fairly distributed. Let’s evaluate the
cluster behavior in relation of Total_Spent and
Income.
market_campaign$cluster <- kmeans_model$cluster
market_campaign <- as.data.frame(market_campaign)
market_campaign %>%
ggplot(aes(x = Total_Spent , y = Income, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Based On Income And Spending") +
theme_gray()
Income vs spending plot shows the clusters pattern
create_component_plots <- function(cluster_num) {
# Subset data for the specific cluster
cluster_data <- data[market_campaign$cluster == cluster_num, ]
# Create individual component plots
plot1 <- ggplot(market_campaign, aes(x = Wines, y = Total_Spent, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Wine") +
theme_gray()
plot2 <- ggplot(market_campaign, aes(x = Fruits, y = Total_Spent, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Fruits") +
theme_gray()
plot3 <- ggplot(market_campaign, aes(x = Meat, y = Total_Spent, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Meat") +
theme_gray()
plot4 <- ggplot(market_campaign, aes(x = Fish, y = Total_Spent, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Fish") +
theme_gray()
plot5 <- ggplot(market_campaign, aes(x = Sweet, y = Total_Spent, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Sweets") +
theme_gray()
plot6 <- ggplot(market_campaign, aes(x = Gold, y = Total_Spent, color = factor(cluster))) +
geom_point() +
scale_color_manual(values = c("darkblue", "pink", "gold", "darkorchid")) +
labs(title = "Cluster's Profile Gold") +
theme_gray()
# Combine the component plots using grid.arrange
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, ncol = 2)
}
# Create a grid of subplots for each cluster
plot_grid <- lapply(unique(market_campaign$cluster), create_component_plots)
# Next step include a Totalnum_Prom_accepted that is the sum of prom
accepted
market_campaign%>%
ggplot(aes(x = Total_AcceptedCmp, fill = factor(cluster))) +
geom_bar() +
labs( x = "Number of Offers Accepted", y = "Total Count")+
theme_gray()
There has not been an overwhelming response to the campaigns so far.
Very few participants overall. Moreover, no one part take in all 5 of
them. Perhaps better-targeted and well-planned campaigns are required to
boost sales.
market_campaign %>%
ggplot(aes(y = Total_Spent, x= cluster, fill = factor(cluster))) +
geom_boxplot()+
theme_grey()
From the above plot, it can be clearly seen that cluster 3 is our
biggest set of customers in terms of expenditures. We can explore what
each cluster is spending on for the targeted marketing strategies.
market_campaign %>%
ggplot(aes(y = NumDealsPurchases, x= cluster, fill = factor(cluster))) +
geom_boxplot()+
theme_grey()
Unlike campaigns, the deals offered did had best outcome with cluster 1
and 4. However, our star customers cluster 3. Nothing seems to attract
cluster 2 overwhelmingly.
Profiling involves generating descriptions of the clusters with
reference to the input variables you used for the cluster analysis.
Profiling acts as a class descriptor for the clusters and will help you
to ‘tell a story’ so that you can understand this information and use it
across your business.
To decide that I will be plotting some of the features that are indicative of the customer’s personal traits in light of the cluster they are in. On the basis of the outcomes, I will be arriving at the conclusions.
grouped_data <- market_campaign %>%
group_by(cluster) %>%
summarise(avg_children = mean(Children_Count), avg_fam = mean(Family_Size) )
plot_list <- lapply(unique(grouped_data$cluster), function(cluster) {
data <- subset(grouped_data, cluster == cluster)
ggplot(data, aes( x = cluster , y = avg_children, fill = factor(cluster))) +
geom_col(position = "stack") +
labs(y = "Average Children Count", title = "Average Children Count by Cluster") +
theme_gray() +
facet_wrap(~ cluster, ncol = 2)
})
plot_list
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
pl2 <- lapply(unique(grouped_data$cluster), function(cluster) {
data <- subset(grouped_data, cluster == cluster)
ggplot(data, aes( x = cluster , y = avg_fam, fill = factor(cluster))) +
geom_col(position = "stack") +
labs(y = "Average Family Size", title = "Average Family Size by Cluster") +
theme_gray() +
facet_wrap(~ cluster, ncol = 2)
})
pl2
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
customer_data <- c("Is_Parent", "Living_With_Partner")
pl3 <- lapply(customer_data, function(var) {
ggplot(market_campaign, aes( y = factor(.data[[var]]), fill = factor(cluster))) +
geom_bar() +
labs(x = "Clusters", y = var)+
theme_gray()
})
grid.arrange(grobs = pl3, ncol = 1)
market_campaign_1 <- market_campaign %>% filter(cluster == 1)
plot_data_1 <- market_campaign_1 %>%
ggplot(aes(y = age)) +
geom_bar(fill = "pink") +
labs( title = "Age vs Total_Spent for Cluster 1")+
theme_gray()
market_campaign_2 <- market_campaign %>% filter(cluster == 2)
plot_data_2 <- market_campaign_2 %>%
ggplot(aes(y = age)) +
geom_bar(fill = "purple") +
labs( title = "Age vs Total_Spent for Cluster 2")+
theme_gray()
market_campaign_3 <- market_campaign %>% filter(cluster == 3)
plot_data_3 <- market_campaign_3 %>%
ggplot(aes(y = age)) +
geom_bar(fill = "darkblue") +
labs( title = "Age vs Total_Spent for Cluster 3")+
theme_gray()
market_campaign_4 <- market_campaign %>% filter(cluster == 4)
plot_data_4 <- market_campaign_4 %>%
ggplot(aes(y = age)) +
geom_bar(fill = "yellow") +
labs( title = "Age vs Total_Spent for Cluster 4")+
theme_gray()
grid.arrange(plot_data_1, plot_data_2, plot_data_3, plot_data_4, ncol = 2)
grouped_data_4 <- market_campaign %>%
group_by(cluster) %>%
summarise(avg_days = mean(Days_of_register))
plot_data_21 <- grouped_data_4 %>%
group_by(cluster)%>%
ggplot(aes(y = avg_days, x = cluster, fill = factor(cluster))) +
geom_col() +
theme_gray()
plot_data_21
plots_1 <- market_campaign %>%
mutate(cluster = as.factor(cluster)) %>%
ggplot() +
geom_density_2d(aes(x = Education, y = Total_Spent, fill = cluster, color = cluster, group = cluster), alpha = 0.5, na.rm = FALSE) +
scale_color_manual(values = c("deeppink", "green3", "dodgerblue2","darkorchid1")) +
scale_fill_manual(values = c("deeppink", "green3", "dodgerblue2","darkorchid1")) +
labs(x = "Education", y = "Total Spent", title = "Educational Profile vs. Total Spent") +
theme_bw()
## Warning in geom_density_2d(aes(x = Education, y = Total_Spent, fill = cluster,
## : Ignoring unknown aesthetics: fill
plots_1
Cluster 1: - Average Children Count : is a second in this
criteria being surpassed only by number 4 - Average Family Size :
Maximal size 4 members, is a second in this criteria being surpassed
only by number 4. - Living With a Partner : Single parents are a subset
of this group, however is predominant the presence of Togheter status. -
Age vs total spent : has its biggest number of consumers in between
30-65 y.o. - Days_of_register : contains consumers that are registered
for longer period.
Cluster 2: - Family Size : Maximum value 3 - Is_Parent: majority
of those consumers for this cluster are parents - Living With a Partner
: Single parents are a subset of this group - Age vs total spent :
relatively younger with age values that concentrates in between 20-45
y.o
Cluster 3: - Average Children Count : smallest number of
children count. - Is_Parent: mainly composed by not parents - Average
Family Size : smallest family sizes at max 2 members - Living with
partner : slightly majority living as a couple reather than single. -
Age vs total spent : more disperse ages 30-60 y.o - Education Level :
between the consumers with educational level 3 is the cluster that tend
to spend more.
Cluster 4 - Is_Parent: Yes, majority - Average Children Count :
biggest between all clusters - Average Family Size : biggest in terms of
average and with max. number of family members equals to 5 and minimal
2. - Living with partner : Yes, majority of them - Age vs total spent :
concentrates between 25-70 y.o - Days_of_register: smallest 300 -
Income: lower income group.
In this project, I`d performed unsupervised clustering. I did use dimensionality reduction using PCA followed by k-means. The final output from cluster up with 4 clusters that were used in profiling customers according to their family structures and income/spending. This can be used in planning better marketing strategies.
]Hill,N.andAlexander,J.(2017),theHandbookofCustomerSatisfactionandLoyalty Measurement, Routledge. Ivens,Bjoern,andKatharinaS.Valta.“CustomerBrandPersonalityPerception:A TaxonomicAnalysis.”JournalofMarketingManagement,vol.28,no.9-10,Taylor&Francis, 2012, pp. 1062–93, doi:10.1080/0267257X.2011.615149.